What is Data Science

Drew Conway’s Venn Diagram of Data Science

Fundamentals

  • What defines success in your business?
  • How is this measured?
  • What can serve as a proxy?
  • Do we have data to support this question?

Data Science Basics

  • What decisions are you making?
  • What data have you collected to support that decision making process?
  • Issues we cannot fix (How this all can fail)
    • Your Data is not all in one place (spread out over lots of different systems)
    • Leadership hasn’t bought in
    • Employees haven’t bought in
    • Searching for data science ‘unicorn’
    • Fail to realize we aren’t going to be perfect
    • Asking bad/unanswerable questions

Bad Questions

  • What can my data tell me?
  • Tell me what I don’t know
  • How can we increase funding/donations
  • Needle in a haystack

Good Questions

  • Is it better to build our website like X or like Y
  • Has public sentiment about our business changed?
  • Is it weird that there is a spike in internet traffic about West Point?
  • What items in our store are bought together?
  • Are there any abnormalities in our data?

Is it better to build our website like…

  • AB Testing
  • Will changing my website layout impact the number of donations?
  • Hypothesis: Placing button in middle will increase number of donations

Data

sample.data<-read.csv("click_data.csv")
sample.data<-sample.data %>% mutate(visit_date=mdy(visit_date))
summarized.data<-sample.data %>% group_by(month(visit_date))%>%
  summarise(avg=mean(clicked_adopt_today))
summarized.data
## # A tibble: 12 x 2
##    `month(visit_date)`   avg
##                  <dbl> <dbl>
##  1                   1 0.197
##  2                   2 0.189
##  3                   3 0.145
##  4                   4 0.15 
##  5                   5 0.258
##  6                   6 0.333
##  7                   7 0.348
##  8                   8 0.542
##  9                   9 0.293
## 10                  10 0.161
## 11                  11 0.233
## 12                  12 0.465

Data

General Technique

  • Randomly assign visitors to one of n web pages
  • Collect number of “successes”
  • After designated time period determine if meaningful difference exists
result<-prop.test(x=c(50,30),n=c(200,200),correct = FALSE)
result$p.value
## [1] 0.01241933

What do these mean?

  • Remember we are tied to a decision here, does new webpage generate more donations/clicks?

  • If you decide to switch to new webpage, there is a 1.3% chance you are wasting your time

  • Does not make decision for you, must factor in cost. If new webpage costs double of old webpage that should be accounted for.

result$conf.int
## [1] 0.02221634 0.17778366
## attr(,"conf.level")
## [1] 0.95

Confidence Interval Analysis

result$conf.int
## [1] 0.02221634 0.17778366
## attr(,"conf.level")
## [1] 0.95
  • Out of 200 people, we would expect 4-35 more to engage with new webpage

  • If each engagement is expected to bring in 1 doller (on average) and new webpage costs an additional 50 dollars a month to maintain, may not be worth it

Warnings

Has public sentiment about our business changed?

  • Assumes metric we are interested in is positive tweets that mention WPAOG

  • Hypothesis: If I do X, then people will tweet more positively about WPAOG

added = read_csv("AOGTweets.csv")

textcleaned = added %>%
  unnest_tokens(word, text)

cleanedarticle = anti_join(textcleaned,stop_words, by = "word") %>%
  select(line,word,screenname,time,followerscount,favoritescount, searchterm)

Has Public Sentiment About our Business Changed?

## # A tibble: 10 x 2
##    dayofweek  AverageDailySentiment
##    <date>                     <dbl>
##  1 2019-11-06                0.221 
##  2 2019-11-07                0.0952
##  3 2019-11-08                0.177 
##  4 2019-11-09                0.218 
##  5 2019-11-10                0.210 
##  6 2019-11-11                0.143 
##  7 2019-11-12                0.0115
##  8 2019-11-13                0.198 
##  9 2019-11-14                0.192 
## 10 2019-11-15                0.108

Has Public Sentiment About our Business Changed?

Most Influential Tweeters

screenname text followerscount favoritescount polaritysent influencescore
WPAOG Today we salute all veterans. Thank you to all the brave men and women and members of the #LongGrayLine who served and continue to serve our great nation. #DutyHonorCountry https://t.co/0nNNDVzVyT 15353 95 0.1793366 17.036979
WPAOG #OTD 12 November-Today is the birthday of Rebecca E. Marier McGuigan (1995), the first woman to be rated the top all-around West Point graduate–in academic, military training &amp; physical trainingsince women first graduated in 1980. She is an active duty Army doctor. #WPAOG150 https://t.co/2IGfg8zSYY 15353 164 0.2392355 39.234618
WPAOG

This weekend, WPAOG proudly welcomes the Class of 1994 back to West Point for their 25th reunion!

We expect more than 380 graduates, family members and guests to return to USMA for the festivities!

#WithCourageWeSoar https://t.co/6w5brxAgPp
15353 32 0.2930217 9.376695

Most Negatively Influential Tweeters

screenname text followerscount favoritescount polaritysent influencescore
WPAOG #OTD 11 Nov 1918, Armistice Day (now Veterans Day in the U.S.) ended World War I, which at the time was The War to End All Wars. Thirty-two West Point graduates were killed in the fighting. #WPAOG150 https://t.co/vTtLOK1wNn 15353 6 -0.2836056 -1.7016336
sailingtobyzant Congratulations to West Point grads in gov. service, except William B Taylor, Jr., Ambassador to Ukraine, who is dishonorably engaged in dishonoring his Commander, President Trump, with triple hearsay to benefit Deep State Anti-American Socialist/Liberal/Communists Democrats. https://t.co/4BWVfz5NVP 2823 1 -0.1092247 -0.1092247
AEVanSaun

Spent the morning sharing a few stories about my old roommate Dennis, and then going on a run. Can’t believe it’s been 14 years.

#StrongGrayLine @WestPoint_USMA @WPAOG https://t.co/jUO3ltYEwf
634 23 -0.0057540 -0.1323428

It’s all relative…

  • Scores are best observed over time
  • Influence score/sentiment are ‘man-made’. Could other things matter?
  • Does this support a decision or is it just nice to know?
  • What does ‘normal’ look like?

Is it weird that there is a spike in internet traffic about West Point?

Is it weird that there is a spike in internet traffic about West Point?

What does normal look like?

gah<-decompose(ts(trends$Response,freq=12,
                  start=decimal_date(ymd("2004-01-01"))))
plot(gah)

What does normal look like?

## # A tibble: 5 x 3
##   Response dayofweek    err
##      <dbl> <date>     <dbl>
## 1       90 2005-11-01  20.7
## 2       90 2006-09-01  23.4
## 3       87 2016-12-01  22.1
## 4      100 2017-12-01  33.6
## 5       93 2018-12-01  26.5

What items in our store are bought together?

  • Market Basket Analysis
  • Popular Way to ‘Mine’ Commercial Databases
  • Not running an experiment or fitting a model, instead discovering associations from an existing dataset
  • General Quesition: “If someone buys \(X\) will they also buy \(Y\)?”
  • Terminology: Support, Confidence, Lift

Market Basket Analysis

  • \(\{Peanut Butter, Jelly\} \to \{Bread\}\)
  • Support Value of .03 means 3% of all baskets had all three things
  • Confidence of .82 means if someone purchased Peanut Butter and Jelly, 82 % of the time they also purchase Bread
  • Confidence divided by probability of just buying Bread times probability of buying peanut butter and jelly gives lift
    • 5% of people buy Bread, 10 % of people buy PB&J, Lift of rule is 82/(5*10)=1.64
  • Easy right? For every set of \(K\) items, \(2^{(K-1)}-1\) possible rules
  • 16 Womens sweaters on WPAOG online store - 32767 possible rules; 30 items 1,073,741,823 possible rules…

Apriori Algorithm

  • Set a threshold, discard all ‘rare’ purchases
  • Next look at all pairs of items that survive first pass, discard rare pairs
  • Continue with three item pairs, etc until dataset is manageable
  • Once dataset is manageable, see what rules has the highest confidence of lift

Example Dataset

Transaction data from 01/12/2010 to 09/12/2011 for a UK-based registered non-store online retail, 541909 Transactions

tr <- read.transactions('market_basket_transactions.csv', format = 'basket', sep=',')
                                            
association.rules <- apriori(tr, parameter = list(supp=0.03, conf=0.85,maxlen=5),
                             control=list(verbose=FALSE))

inspect(association.rules[1:6])
##     lhs                                  rhs                                     support confidence      lift count
## [1] {BLUE HAPPY BIRTHDAY BUNTING}     => {PINK HAPPY BIRTHDAY BUNTING}        0.03521610  0.8555556 19.691287   154
## [2] {CANDLEHOLDER PINK HANGING HEART} => {WHITE HANGING HEART T-LIGHT HOLDER} 0.04070432  0.9128205  5.238536   178
## [3] {PINK REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER}    0.05694032  0.9154412 11.914358   249
## [4] {GREEN REGENCY TEACUP AND SAUCER,                                                                              
##      PINK REGENCY TEACUP AND SAUCER}  => {ROSES REGENCY TEACUP AND SAUCER}    0.04962268  0.8714859 10.384218   217
## [5] {PINK REGENCY TEACUP AND SAUCER,                                                                               
##      ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER}    0.04962268  0.9475983 12.332878   217
## [6] {PINK REGENCY TEACUP AND SAUCER,                                                                               
##      REGENCY CAKESTAND 3 TIER}        => {GREEN REGENCY TEACUP AND SAUCER}    0.04504917  0.9336493 12.151334   197

Using Data Effectively to Tell a Story (Tufte)

  • Above all else - show the data
  • Maximize data to ink ratio
  • Erase non-data-ink
  • Erase redundant data-ink
  • Revise and edit

Data Story

Data Story

Don’t say just one thing

What’s The Story Here?

Weekends

Holidays

Remove Holidays and Weekends

What else is going on?

September Babies

Becoming an Organization that Effectively Uses Data

  • Starts with the question; why do you want to use data?

  • Data that is not accessible might as well not exist

  • Databasing is boring; cleaning data is boring; but both must be done

  • Don’t try to fix yesterday, find question you want answered today and start collecting relevant data

  • Don’t focus on sledgehammers; smaller tools are most of the time better

  • Buzz words: Big Data, Deep Learning, AI/ML

  • People are by nature story tellers; use data to tell the story, don’t let data be the story

Thanks!